Agregar procesamiento automático de referencias y soporte markuplib para DOCX#60
Agregar procesamiento automático de referencias y soporte markuplib para DOCX#60eduranm wants to merge 8 commits intoscieloorg:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Este PR agrega la base para analizar DOCX de forma estructural (vía markuplib) e integrar el procesamiento automático de referencias dentro del flujo de markup_doc, disparándolo al crear/cargar un documento.
Changes:
- Registra las apps
markup_docymarkupliby agrega utilidades base para análisis estructural de DOCX. - Incorpora tareas Celery para procesar el DOCX cargado, detectar referencias y persistir el contenido procesado en el documento.
- Añade hooks/admin de Wagtail para el flujo de carga y sincronización de colecciones/journals desde la API.
Reviewed changes
Copilot reviewed 16 out of 23 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
model_ai/llama.py |
Ajuste del flujo Gemini en LlamaService (incluye pausa fija tras respuesta). |
markuplib/function_docx.py |
Nuevas utilidades para abrir y extraer contenido/estructura desde DOCX. |
markuplib/__init__.py |
Inicialización del paquete markuplib. |
markup_doc/wagtail_hooks.py |
ViewSets y hooks Wagtail para carga/edición y disparo del procesamiento automático. |
markup_doc/tests.py |
Archivo de tests (placeholder). |
markup_doc/tasks.py |
Tarea Celery para procesar el DOCX y estructurar contenido + referencias. |
markup_doc/sync_api.py |
Sincronización de colecciones y journals desde SciELO Core API. |
markup_doc/models.py |
Modelos y StreamFields para persistir front/body/back y metadatos. |
markup_doc/migrations/__init__.py |
Inicialización del módulo de migraciones. |
markup_doc/migrations/0001_initial.py |
Migración inicial para los modelos de markup_doc. |
markup_doc/migrations/0002_alter_articledocx_estatus_and_more.py |
Ajuste de campos/choices para estatus. |
markup_doc/marker.py |
Utilidades para marcado vía LLM (artículo/referencias). |
markup_doc/labeling_utils.py |
Utilidades de segmentación, extracción de citas APA y mapeo/etiquetado. |
markup_doc/forms.py |
Base de formulario (placeholder). |
markup_doc/choices.py |
Choices/estructura base de etiquetas y reglas de orden. |
markup_doc/apps.py |
AppConfig de markup_doc. |
markup_doc/admin.py |
Admin Django (placeholder). |
markup_doc/__init__.py |
Inicialización del paquete markup_doc. |
fixtures/e14790.docx |
DOCX de ejemplo para pruebas manuales. |
config/settings/base.py |
Registro de markup_doc y markuplib en INSTALLED_APPS. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if model.name_file: | ||
| user = User.objects.get(pk=user_id) | ||
| refresh = RefreshToken.for_user(user) | ||
| access_token = refresh.access_token | ||
|
|
||
| #url = "http://172.17.0.1:8400/api/v1/mix_citation/reference/" | ||
| #url = "http://172.17.0.1:8009/api/v1/mix_citation/reference/" | ||
|
|
||
| # FIXME: Hardcoded URL | ||
| url = "http://django:8000/api/v1/reference/" | ||
|
|
||
| headers = { | ||
| 'Authorization': f'Bearer {access_token}', | ||
| 'Content-Type': 'application/json' | ||
| } | ||
|
|
||
| response = requests.post(url, json=payload, headers=headers) | ||
|
|
There was a problem hiding this comment.
In process_reference(), access_token and url are only set inside if model.name_file:, but headers and requests.post() run unconditionally. If name_file is blank (e.g., using a remote API), this will raise UnboundLocalError. Initialize url/access_token for both branches or return/raise when the required config is missing.
| def match_section(item, sections): | ||
| return {'label': '<sec>', 'body': True} if ( | ||
| item.get('font_size') == sections[0].get('size') and | ||
| item.get('bold') == sections[0].get('bold') and | ||
| item.get('text', '').isupper() == sections[0].get('isupper') | ||
| ) else None | ||
|
|
||
|
|
||
| def match_subsection(item, sections): | ||
| return {'label': '<sub-sec>', 'body': True} if ( | ||
| item.get('font_size') == sections[1].get('size') and | ||
| item.get('bold') == sections[1].get('bold') and | ||
| item.get('text', '').isupper() == sections[1].get('isupper') | ||
| ) else None |
There was a problem hiding this comment.
match_section()/match_subsection() index sections[0] and sections[1] without checking length. If sections has fewer than 2 entries (common for short/simple documents), this will raise IndexError. Add guards (e.g., if len(sections) > 0/1) before indexing.
| if not result: | ||
| result = {'label': '<p>', 'body': state['body'], 'back': state['back']} | ||
| state['label'] = result.get('label') | ||
| state['body'] = result.get('body') | ||
| state['back'] = result.get('back') | ||
|
|
||
| if result: | ||
| pass | ||
| else: | ||
| if state.get('label_next'): | ||
| if state.get('repeat'): | ||
| result = match_by_regex(item.get('text'), order_labels) | ||
| if result: | ||
| state['label'] = result[0] | ||
| else: | ||
| result = match_by_style_and_size(item, order_labels, style='bold') | ||
| if result: | ||
| state['label'] = result[0] | ||
| state['repeat'] = None | ||
| state['reset'] = None | ||
| state['label_next'] = result[1].get("next") | ||
| state['body'] = result[1].get("size") == 16 | ||
| if state['body'] and re.search(r"^(refer)", item.get('text').lower()): | ||
| state['body'] = False | ||
| state['back'] = True | ||
| if not result: | ||
| result = match_next_label(item, state['label_next'], order_labels) | ||
| if result: | ||
| state['label'] = result[0] | ||
| state['label_next_reset'] = result[1].get("next") | ||
| state['reset'] = result[1].get("reset", False) | ||
| state['repeat'] = result[1].get("repeat", False) | ||
| else: | ||
| result = match_by_style_and_size(item, order_labels, style='bold') | ||
| if result: | ||
| state['label'] = result[0] | ||
| state['label_next'] = result[1].get("next") | ||
| if state.get('body') and re.search(r"^(refer)", item.get('text').lower()): | ||
| state['body'] = False | ||
| state['back'] = True | ||
| else: | ||
| result = match_by_style_and_size(item, order_labels, style='italic') | ||
| if result: | ||
| state['label'] = re.sub(r"-\d+", "", result[0]) | ||
| state['label_next'] = result[1].get("next") | ||
| else: | ||
| result = match_by_regex(item.get('text'), order_labels) | ||
| if result: | ||
| state['label'] = result[0] | ||
| else: | ||
| result = match_paragraph(item, order_labels) | ||
| if result: | ||
| state['label'] = result[0] |
There was a problem hiding this comment.
In create_labeled_object2(), result is forced to a non-empty dict at line 700 and then the else: branch (which contains most of the labeling logic) becomes unreachable because of if result: pass. This makes the function effectively label everything as <p> unless it matches the section/subsection checks. Rework the control flow so the detailed matching logic can run when appropriate.
| obj['type'] = 'aff_paragraph' | ||
|
|
||
| if re.search(r"^(translation)", item.get('text').lower()): | ||
| state['label'] = '<translate-fron>' |
There was a problem hiding this comment.
state['label'] = '<translate-fron>' looks like a typo (missing 't') and will produce a label that doesn't match the choices (<translate-front>). Use the correct label string so downstream logic can recognize it.
| state['label'] = '<translate-fron>' | |
| state['label'] = '<translate-front>' |
| response_gemini = model.generate_content(user_input).text | ||
| time.sleep(15) | ||
| return response_gemini |
There was a problem hiding this comment.
time.sleep(15) after every Gemini call will throttle all reference processing and can tie up Celery workers even when the request succeeds. Consider removing the unconditional sleep and instead implement retry/backoff only when Gemini returns rate-limit/transient errors (e.g., 429/503), ideally with jitter.
| def update(cls, title, estatus): | ||
| try: | ||
| obj = cls.get(title=title) | ||
| except (cls.DoesNotExist, ValueError): | ||
| pass | ||
|
|
||
| obj.estatus = estatus | ||
| obj.save() | ||
| return obj |
There was a problem hiding this comment.
In update(), if get() raises DoesNotExist, the exception is swallowed and obj is left undefined, but the code still tries to set obj.estatus. Either re-raise/return early when not found, or create the object as appropriate.
| if is_numPr: | ||
| numPr = paragraph.find('.//w:numPr', namespaces=paragraph.nsmap) | ||
| numId = numPr.find('.//w:numId', namespaces=paragraph.nsmap).get(namespaces_p + 'val') | ||
| type = [(key, objt) for key, objt in list_types.items() if objt['numId'] == numId] | ||
|
|
||
| #Es una lista diferente | ||
| if numId != current_num_id: | ||
| current_num_id = numId | ||
| if len(current_list) > 0: | ||
| current_list.append('[/list]') | ||
| objl = {} | ||
| objl['type'] = 'list' | ||
| objl['list'] = '\n'.join(current_list) | ||
| current_list = [] | ||
| content.append(objl) | ||
| list_type = 'bullet' | ||
| if type[0][1][str(0)] == 'decimal': | ||
| list_type = 'order' |
There was a problem hiding this comment.
extract_numbering_info() can return None when word/numbering.xml is missing, but extractContent() unconditionally does list_types.items() and later indexes type[0]. This will raise at runtime for DOCX files without numbering or with unexpected numId mappings; handle list_types is None and the empty-match case before using it.
| else: | ||
| obj['spacing'] = False | ||
|
|
||
| clean_text = clean_labels(child.text) |
There was a problem hiding this comment.
clean_labels(child.text) will fail when child.text is None (common for <w:r> elements which usually contain <w:t> children). This will raise a TypeError in re.sub. Extract text from the run's <w:t> nodes (or guard against None) before calling clean_labels.
| clean_text = clean_labels(child.text) | |
| run_text_nodes = child.xpath('.//w:t/text()', namespaces=child.nsmap) | |
| raw_text = ''.join(run_text_nodes) if run_text_nodes else (child.text or '') | |
| clean_text = clean_labels(raw_text) |
| def get_labels(title, user_id): | ||
| article_docx = UploadDocx.objects.get(title=title) |
There was a problem hiding this comment.
UploadDocx.objects.get(title=title) relies on non-unique titles and can raise MultipleObjectsReturned / select the wrong row. Prefer passing a primary key to the task and fetching by pk.
| def get_labels(title, user_id): | |
| article_docx = UploadDocx.objects.get(title=title) | |
| def get_labels(upload_docx_id, user_id): | |
| article_docx = UploadDocx.objects.get(pk=upload_docx_id) |
| # FIXME: This function always fetches the first LlamaModel instance. | ||
| model_ai = LlamaModel.objects.first() | ||
|
|
||
| if model_ai.api_key_gemini: |
There was a problem hiding this comment.
get_llm_model_name() assumes a LlamaModel row always exists; if the table is empty, model_ai will be None and model_ai.api_key_gemini will raise. Guard with if model_ai and model_ai.api_key_gemini: (and decide on a sensible default when it is None).
| if model_ai.api_key_gemini: | |
| if model_ai and model_ai.api_key_gemini: |
O que esse PR faz?
Agrega la base para procesar automáticamente referencias bibliográficas dentro de
markup_doce incorporamarkuplibpara lectura estructural de archivos DOCX.Incluye:
markuplib;markuplib/con utilidades para analizar DOCX;markup_docpara procesar y marcar referencias;Onde a revisão poderia começar?
Por commits
Como este poderia ser testado manualmente?
Levantar el entorno;
Cargar un DOCX desde el flujo de
markup_doc;Verificar que el documento pase a estado
PROCESSING;Una vez terminado, revisar que las referencias se agreguen estructuradas en el documento procesado.
Algum cenário de contexto que queira dar?
Se enfoca en el procesamiento automático de referencias y en la lectura estructural del DOCX, dejando lista la base para continuar con front, texto y salida XML.
Screenshots
N/A
Quais são tickets relevantes?
#59
Referências